Model Training on Season Average Stats

The goal of this notebook is to train a model that can predict inflation-adjusted salary based on average stats (as outputted by the 3-feature-engineering notebook).

Load Dataset

Train-Test Split

To start off, we conduct an 80-20 train-test split:

Base (Dummy) model

We construct a base model to compare model results against. This model will just output the mean every time.

Data Scaling

We scale/normalize the data to perform better on different types of models. The resultant preprocessor will be used in the model pipelines.

Hyperparameter Optimization and Model Selection

We run different combinations and models on the dataset:

We use HalvingGridSearchCV instead of GridSearchCV in this particular instance, because we have a lot of models and model parameters that we want to test. Once we have the best-performing models from this hyperparameter tuning process, we will re-tune the best models using regular GridSearchCV.

Save best hyperparameters

Save in for later use (since running the training takes a long time).

Save Training Results for each model

Save the CV results as CSV files for each model, in case we want to refer to it later on.

Plot learning curves for each model

The interesting thing about using HalvingGridSearchCV is that it trains model parameters based on subsets of the training data — as the training progresses, the "bad" model parameters are pruned out, and the dataset continues to increase in size. It is therefore possible to draw learning curves for each of the models to see their average performance as the size of the training dataset increases.

Evaluate Models on Test Set

We evaluate the best-performing version of each of the 9 models on the test set and save the results.

Based on the R^2 values, AdaBoostRegressor and RandomForestRegressor seem to perform the best.

Dimensionality reduction with PCA

The best model that we currently have is the AdaBoostRegressor. While that model is currently performing okay, we want to try reducing the number of dimensions to remove correlated variables and see if we get better performance. The hyperparameters used for this analysis are the same as the optimal hyperparameters found via HalvingGridSearchCV.

There is not a clear "plateau" in the graph, as the test score continues to increase. It seems that the usage of PCA does not have a noticeable positive effect on the test score, so we will proceed with simply the original dataset.

Final Training

To train the final model, we do hyperparameter tuning on AdaBoostRegressor. Additionally, we tune the parameters of the base estimator (DecisionTreeRegressor) using the syntax for nested parameter tuning in GridSearchCV. We use GridSearchCV instead of HalvingGridSearchCV because we are only testing one model, and we want to get the best combination of parameters possible.

Plotting Learning Curve and Feature Importance for Best Estimator

Using scikit-plot, we can plot the learning curve of the AdaBoostRegressor based on different-sized datasets, and see which features the model considers to be the most important.

According to the graph below, it looks like total rebounds have the highest importance, followed by the team, the particular season, total steals, inflation-adjusted team payroll, field goal attempts, and so on.

Evaluate model on test set

We evaluate the performance of our final model on the 20% test set:

Save best hyperparamaters and model for final model